Development and Evaluation of an Italian Broadcast News Corpus
نویسندگان
چکیده
This paper reports on the development and evaluation of an Italian broadcast news corpus at ITC-irst, under a contract with the European Language resources Distribution Agency (ELDA). The corpus consists of 30 hours of recordings transcribed and annotated with conventions similar to those adopted by the Linguistic Data Consortium for the DARPA HUB-4 corpora. The corpus will be completed and released to ELDA by April 2000.
منابع مشابه
Advances in automatic transcription of Italian broadcast news
This paper presents some recent improvements in automatic transcription of Italian broadcast news obtained at ITCirst. A first preliminary activity was carried out in order to develop a suitable speech corpus for the Italian language. The resulting corpus, formed by recordings covering 30 hours of radio news, was exploited for developing a baseline system for transcription of broadcast news. Th...
متن کاملA baseline for the transcription of Italian broadcast news
This paper presents the first achievements in the development of a broadcast news transcription system to be applied for the processing of huge audio archives. In particular, the Italian broadcast news corpus under collection is introduced, and the first implemented baseline system is outlined. The baseline system consists of an audio segmentation module and a speech recognizer featuring a recu...
متن کاملSpeaker tracking in a broadcast news corpus
Speaker tracking is the process of following who says something in an audio stream. In the case the audio stream is a recording of broadcast news, speaker identity can be an important meta-data for building digital libraries. Moreover, the segmentation and classification of the audio stream in terms of acoustic contents, bandwidth and speaker gender allow to filter out portions of the signal wh...
متن کاملTUKE-BNews-SK: Slovak Broadcast News Corpus Construction and Evaluation
This article presents an overview of the existing acoustical corpuses suitable for broadcast news automatic transcription task in the Slovak language. The TUKE-BNews-SK database created in our department was built to support the application development for automatic broadcast news processing and spontaneous speech recognition of the Slovak language. The audio corpus is composed of 479 Slovak TV...
متن کاملMatbn 2002: a Mandarin Chinese Broadcast News Corpus
The MATBN 2002 Mandarin Chinese broadcast news corpus contains a total of 40 hours of broadcast news from Public Television Service Foundation (Taiwan) with corresponding transcripts. The primary motivation for this collection is to provide training and testing data for continuous speech recognition evaluation in the broadcast domain. We expect to collect and process 220 hours of Mandarin Chine...
متن کامل